# Fit the logistic regression model with selected variables using their range and the median for the rest
# Used the first model to come up with the most "significant" ones
model_glm <- glm(Underdeveloped ~ .,
data = scaled_data_with_undev, family = "binomial")
# Create the grid for prediction
grid <- full_stats_df |>
data_grid(
mean_agr_land = median_agr_land,
mean_life_exp = seq_range(c(-2.627905, 1.588857), n = 5),
mean_elec_access = seq_range(c(-2.6785616, 0.7295024), n = 5),
mean_fert_rate = seq_range(c(-1.274220, 3.082338), n = 20),
mean_sanit = median_sanit,
mean_inter = median_internet,
mean_pop_growth = median_population_growth,
mean_prim_school = median_primary_school,
mean_total_unempl = median_total_unempl
)
# Predict probabilities that a country is Underdeveloped
aug_model <- augment(model_glm, newdata = grid, se_fit = TRUE) |>
mutate(.predprob = plogis(.fitted))Introduction
Being an intern with the Sub-Saharan Africa Poverty Team with the World Bank Group last summer, I realized that the lack of data regarding underdeveloped countries is something that needs to be addressed. This insufficient provision of data does not allow extensive research to be conducted and therefore lead to detailed solutions.
The purpose of this project is to “effectively” construct a classification model for underdeveloped countries, given a set of variables from The World Bank Data Repository. The initial selection of variables was indeed challenging, since most data sets would not contain enough data for underdeveloped countries, with NA values overwhelming the corresponding rows. Because of that, I decided to focus on the 21st century and ignore data before the year 2000. Two primary analytical approaches employed in this project are: logistic regression and neural networks.
Variable Description
After loading all the appropriate data sets from the World Bank Repository, we end up with a data set that contains the means and standard deviations of the variables below:
| Variable | Description |
|---|---|
| agriculture_land | The share of land area that is arable, under permanent crops, and under permanent pastures |
| birth_life_exp | The number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life |
| electricity_access | The percentage of population with access to electricity |
| fertility_rate | The number of children that would be born to a woman if she were to live to the end of her childbearing years and bear children in accordance with age-specific fertility rates of the specified year |
| internet | The percentage of population that uses the internet |
| population_growth | Annual population growth rate for year t is the exponential rate of growth of midyear population from year t-1 to t, expressed as a percentage |
| primary_school_enrol | Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown |
| sanitation | Percentage of people using at least basic sanitation services, that is, improved sanitation facilities that are not shared with other households |
| total_unemployment | The share of the labor force that is without work but available for and seeking employment |
Logistic Regression Model Attempts
The initial attempt to create a classifier began with logistic regression models due to their interpretability, computational efficiency, and flexibility in handling diverse datasets. The process involved an exploratory analysis of the dataset, identifying significant variables through pairwise plots and model summaries. The models were trained on a normalized and scaled dataset, augmented with variable ranges for prediction. However, the visualizations showed that logistic regression models with different combinations of proxies were insufficient to capture the complexities of the data. This suggests limitations in using simple models to interpret the relationships between variables. Hence, further exploration of more complex modeling approaches is warranted to better understand these relationships and improve classification accuracy.
# Fit the logistic regression model with selected variables using their range and the median for the rest
# Used the first model to include the ones that were significant but not as much
grid <- full_stats_df |>
data_grid(
mean_agr_land = median_agr_land,
mean_life_exp = seq_range(c(-2.627905, 1.588857), n = 10),
mean_elec_access = median_elec_access,
mean_fert_rate = seq_range(c(-1.274220, 3.082338), n = 10),
mean_sanit = median_sanit,
mean_inter = median_internet,
mean_pop_growth =seq_range(c(-2.022998, 4.438928), n = 3),
mean_prim_school = median_primary_school,
mean_total_unempl = median_total_unempl
)
# Predict probabilities for the grid
aug_model <- augment(model_glm, newdata = grid, se_fit = TRUE) |>
mutate(.predprob = plogis(.fitted))# Fit the logistic regression model with selected variables using their range and the median for the rest
# Used the first model to include the ones that were significant but not as much
grid <- full_stats_df |>
data_grid(
mean_agr_land = median_agr_land,
mean_life_exp = median_life_exp,
mean_elec_access = median_elec_access,
mean_fert_rate = seq_range(c(-1.308798, 2.850897), n = 10),
mean_sanit = median_sanit,
mean_inter = seq_range(c(-1.511381, 2.405766), n = 10),
mean_pop_growth =seq_range(c(-2.221712, 3.779502), n = 3),
mean_prim_school = median_primary_school,
mean_total_unempl = median_total_unempl
)
# Predict probabilities for the grid
aug_model <- augment(model_glm, newdata = grid, se_fit = TRUE) |>
mutate(.predprob = plogis(.fitted))Neural Network Application
Due to the disappointing results from the application of logistic regression models, this next attempt incorporates a much more complex classification model, a neural network: a machine learning technique inspired by the human brain’s processing of data. The neural network, developed and trained on a Colab Notebook, is applied to generate predictions for the complex dataset. Initially, the data is split into training and testing sets, followed by the construction of a neural network model with dense layers utilizing RELU activation functions and a final layer employing a sigmoid function for binary classification of countries into the two development categories. The model resulted to a remarkable 95.65% accuracy and this promising performance’s is reflected in the visual representation of the results.
The first map represents the actual classification of underdeveloped countries, filled with a blue color. The second map represents the predicted outcomes of our neural network, with the predicted underdeveloped countries highlighted with a dark red color.
predictions <- read.csv("data/predicted_dataset.csv")plot_1 <- ggplot()+
geom_polygon(data = world_df, mapping = aes(x = long, y = lat, group = group, label = region), fill = "grey")+
geom_polygon(data = underdeveloped_map, mapping = aes(x = long, y = lat, group = group, fill = Underdeveloped, label = region))+
labs(title = "Actual Representation of Underdeveloped Countries")+
scale_fill_manual(values = c("0" = "grey", "1" = "darkblue")) +
theme_minimal()+
theme(
legend.position = "none"
)
ggplotly(plot_1, tooltip = "label")plot_2 <- ggplot()+
geom_polygon(data = world_df, mapping = aes(x = long, y = lat, group = group, label = region), fill = "grey")+
geom_polygon(data = predicted_full_df, mapping = aes(x = long, y = lat, group = group, fill = as.factor(Underdeveloped),
label = `Country Name`))+
labs(title = "Representation of Underdeveloped Countries using Predictions of Neural Network")+
scale_fill_manual(values = c("0" = "grey", "1" = "darkred")) +
theme_minimal()+
theme(
legend.position = "none"
)
ggplotly(plot_2, tooltip = "label")Conclusion
While the maps above indicate that the network does an exceptional job given the classification task provided, the table below raises some concerns. According to the table below, the model classified 51 countries as underdeveloped, when there are only 42. On the other hand, the model classified 176 countries as developed, close to their real number of 185. This suggests that while the model performs relatively well in identifying developed countries, it struggles more with accurately classifying underdeveloped ones. Proportionally speaking, the model had a successful classification rate of 95.14% for developed countries, falling to 82.35% for underdeveloped ones. This result stresses the initial question this project was aimed to answer: THERE IS NOT ENOUGH DATA FOR UNDERDEVELOPED COUNTRIES. Therefore, with insufficient data to work with, the provision of accurate and precise solutions for countries in the underdeveloped category seems far from reality.
library(pander)
# Convert summaries to data frames
summary_actual_df <- as.data.frame(summary_actual)
summary_predicted_df <- as.data.frame(summary_predicted)
# Add row names for clarity
row.names(summary_actual_df) <- "Actual"
row.names(summary_predicted_df) <- "Predicted"
combined_summary <- rbind(summary_actual_df, summary_predicted_df)
# Print combined summary using pander
pander(combined_summary, caption = "Summary of Actual and Predicted Classes")| Underdeveloped | Developed | |
|---|---|---|
| Actual | 42 | 185 |
| Predicted | 51 | 176 |
This project encourages people and myself included, to utilize it as a starting point and attempt to create a larger data set, covering a wider range of variables to produce a more accurate classifier. In addition to that, one can examine the sectors of a country that there is urgent need for more data and suggest strategies that would results to the consolidation of useful insights, leading to more targeted and sufficient research.